In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Preparing data

Loading the review spreasheet as a multi-index (tsv)

In [2]:
base = pd.read_csv("../data/InterfaceReview-May2019.tsv", sep="\t", index_col= [0,1], skipinitialspace=True)
In [3]:
base.head()
Out[3]:
Austrian Newspapers Online (ANNO) Ancestry British Newspaper Archives California Digital Newspaper Collection (CDNC) Chronicling America Colorado Historical Newspaper Collection (CHNC) Delpher DigiPress DIFMOE E-luxemburgensia ... Georgia Historic Newspapers Libraria - Ukrainian online periodicals archive New York Times POLONA Retronews Scriptorium StaBi Tessmann Le Temps archives Trove
interface URL http://anno.onb.ac.at/ https://www.newspapers.com/ https://www.britishnewspaperarchive.co.uk/ https://cdnc.ucr.edu/cgi-bin/cdnc http://chroniclingamerica.loc.gov https://www.coloradohistoricnewspapers.org/ https://www.delpher.nl/nl/kranten/ https://digipress.digitale-sammlungen.de/ https://www.difmoe.eu/d/ http://www.eluxemburgensia.lu/ ... https://gahistoricnewspapers.galileo.usg.edu/ https://libraria.ua/en/ https://timesmachine.nytimes.com/browser https://polona.pl/ http://www.retronews.fr/ https://scriptorium.bcu-lausanne.ch/ http://zefys.staatsbibliothek-berlin.de/ http://digital.tessmann.it https://www.letempsarchives.ch/ http://trove.nla.gov.au/newspaper/?q=
Target area Austria (and former AH empire) US, UK, AUS, CAN British isles California, US United States Colorado, US Netherland and its former colonies Bavaria, Germany East and Central Europe Luxembourg ... Georgia, US Ukraine NY Times Poland France Vaud, Switzerland Germany Tyrol (Italy) Vaud, Switzerland Australia
Creator National library of Austria Ancestry.com British Library California State Library, National Endowment f... Library of Congress Colorado State Library, History Colorado, Coll... Royal Dutch Library National Library of Bavaria (BSB) international cooperation, lead by Germany (I... National Library of Luxembourg ... Digital Library of Georgia (DLG) Archival Information System New York Times Polish public libraries BnF Partenariats Bibliothèque Cantonale et Universitaire de Lau... Berlin State Library Provincial library of Bozen Le Temps, Swiss National Library, EPFL National Library of Australia
Purpose and scope Collection of historical newspaper and journal... Collection of EN-speaking newspapers, primary ... collection of British newspapers To digitize California newspapers for the Nati... Collection of selected American newspapers dig... The long-term goal for CHNC is to provide acce... Access to digitized texts incl. books, journal... since 1997, the BSB has been digitising its co... gather digitised materials, not only newspaper... digitized collections of the National Library ... ... Georgia Newspapers - Sharing Georgia's history... The goal of the project is to digitize and pro... Access to NYT archives give access to the digitised objects of the li... Explorer le passé pour mieux lire le présent. ... Scriptorium met à disposition du public des co... digitized collection of the Berliner City Libr... Collection from various cultural institutions ... Proposer les collections numérisées des trois ... -
Approximate date of creation 2003 u u 2008 u ca 2005 ongoing ca 2016 u u ... 2007 2012 u u 2016 2012 u u 2016 2007

5 rows × 24 columns

Testing some indexing

In [ ]:
base.loc['apis','IIIF Image API']
In [ ]:
base.loc['information on digitization','OCR confidence scores'].describe()
In [ ]:
base.loc['newspaper metadata','Place of publication']
In [ ]:
# counts will not work for categorical data
base.loc['newspaper collection', 'Languages of the collections']

Do some cleaning

Trim string to be sure

In [4]:
def trim_all_columns(df):
    """
    Trim whitespace from ends of each value across all series in dataframe
    """
    trim_strings = lambda x: x.strip() if type(x) is str else x
    return df.applymap(trim_strings)

# trim
base = trim_all_columns(base)

Remove undesirable rows (i.e. not entirely binary)

In [5]:
base = base.drop('interface', level=0)
base = base.drop('newspaper collection', level=0)
base = base.drop('Other', level=1)
base = base.drop('Languages of the collections', level=1)
base = base.drop('Download options (file formats)', level=1)

Replace n and y by 0 and 1 (ideally to clean in spreadsheet)

In [6]:
base = base.replace(to_replace=['y', 'y?', 'y (annotations)', 'y (requires user account - free)', 'n', '?', 'u', 'n?'], value=[1,1,1,1,0,0,0,0])
In [7]:
base.head()
Out[7]:
Austrian Newspapers Online (ANNO) Ancestry British Newspaper Archives California Digital Newspaper Collection (CDNC) Chronicling America Colorado Historical Newspaper Collection (CHNC) Delpher DigiPress DIFMOE E-luxemburgensia ... Georgia Historic Newspapers Libraria - Ukrainian online periodicals archive New York Times POLONA Retronews Scriptorium StaBi Tessmann Le Temps archives Trove
newspaper metadata Alternative titles, succeeding titles, related titles 1 0 1 1 1 0 0 1 0 1 ... 1 0 0 1 1 0 1 1 0 1
Place of publication 1 1 1 1 1 1 1 1 1 1 ... 1 1 0 1 1 0 1 1 0 1
Geographic coverage 0 1 1 0 1 0 1 1 1 0 ... 1 0 0 0 0 0 1 1 0 0
Publisher 1 0 1 0 1 0 1 1 1 1 ... 1 0 0 1 1 0 1 0 0 0
Date range 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 0 1

5 rows × 24 columns

Radar factory

In [8]:
import matplotlib.pyplot as plt
from matplotlib.patches import Circle, RegularPolygon
from matplotlib.path import Path
from matplotlib.projections.polar import PolarAxes
from matplotlib.projections import register_projection
from matplotlib.spines import Spine
from matplotlib.transforms import Affine2D

def radar_factory(num_vars, frame='circle'):
    """Create a radar chart with `num_vars` axes.

    This function creates a RadarAxes projection and registers it.

    Parameters
    ----------
    num_vars : int
        Number of variables for radar chart.
    frame : {'circle' | 'polygon'}
        Shape of frame surrounding axes.

    """
    # calculate evenly-spaced axis angles
    theta = np.linspace(0, 2*np.pi, num_vars, endpoint=False)

    class RadarAxes(PolarAxes):

        name = 'radar'
        # use 1 line segment to connect specified points
        RESOLUTION = 1

        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            # rotate plot such that the first axis is at the top
            self.set_theta_zero_location('N')

        def fill(self, *args, closed=True, **kwargs):
            """Override fill so that line is closed by default"""
            return super().fill(closed=closed, *args, **kwargs)

        def plot(self, *args, **kwargs):
            """Override plot so that line is closed by default"""
            lines = super().plot(*args, **kwargs)
            for line in lines:
                self._close_line(line)

        def _close_line(self, line):
            x, y = line.get_data()
            # FIXME: markers at x[0], y[0] get doubled-up
            if x[0] != x[-1]:
                x = np.concatenate((x, [x[0]]))
                y = np.concatenate((y, [y[0]]))
                line.set_data(x, y)

        def set_varlabels(self, labels, fontsize):
            self.set_thetagrids(np.degrees(theta), labels, fontsize=fontsize)

        def _gen_axes_patch(self):
            # The Axes patch must be centered at (0.5, 0.5) and of radius 0.5
            # in axes coordinates.
            if frame == 'circle':
                return Circle((0.5, 0.5), 0.5)
            elif frame == 'polygon':
                return RegularPolygon((0.5, 0.5), num_vars,
                                      radius=.5, edgecolor="k")
            else:
                raise ValueError("unknown value for 'frame': %s" % frame)

        def _gen_axes_spines(self):
            if frame == 'circle':
                return super()._gen_axes_spines()
            elif frame == 'polygon':
                # spine_type must be 'left'/'right'/'top'/'bottom'/'circle'.
                spine = Spine(axes=self,
                              spine_type='circle',
                              path=Path.unit_regular_polygon(num_vars))
                # unit_regular_polygon gives a polygon of radius 1 centered at
                # (0, 0) but we want a polygon of radius 0.5 centered at (0.5,
                # 0.5) in axes coordinates.
                spine.set_transform(Affine2D().scale(.5).translate(.5, .5)
                                    + self.transAxes)
                return {'polar': spine}
            else:
                raise ValueError("unknown value for 'frame': %s" % frame)

    register_projection(RadarAxes)
    return theta
In [162]:
def build_single_radar(labels, values, title, grid, figure_title):
    N = len(labels)
    theta = radar_factory(N, frame='polygon')
    
    fig, ax = plt.subplots(figsize=(10,10), subplot_kw=dict(projection='radar'))
    fig.subplots_adjust(top=0.85, bottom=0.05)

    #ax.set_rgrids([2, 4, 6, 8])
    ax.set_rgrids(grid, labels=[str(i) for i in grid], size='large')
    ax.set_title(title,  position=(0.5, 1.1), ha='center')

    for d in values:
        line = ax.plot(theta, d)
        ax.fill(theta, d,  alpha=0.25)
    ax.set_varlabels(labels, fontsize=12)
    
    if figure_title is None:
        plt.show()
    else:
        plt.savefig(f'../charts/{figure_title}.pdf', format='pdf', quality=95)
In [138]:
def build_multiple_radar(labels, values, titles, grid, figure_title):
    N = len(labels)
    theta = radar_factory(N, frame='polygon')
        
    fig, axes = plt.subplots(figsize=(120, 80), nrows=4, ncols=6,
                             subplot_kw=dict(projection='radar'))
    
    fig.subplots_adjust(wspace=0.50, hspace=0.20, top=0.85, bottom=0.05)

    for ax, case_data, title in zip(axes.flatten(), values, titles):
        #ax.set_rgrids(['2', '4', '6', '8'])
        #ax.set_rgrids(grid, labels=[str(i) for i in grid], size='large')
        ax.set_ylim(0, 30)
        ax.set_title(title, weight='bold', fontsize=42, position=(0.5, 1.1),
                     horizontalalignment='center', verticalalignment='center')
        line = ax.plot(theta, case_data)
        ax.fill(theta, case_data,  alpha=0.25)
        ax.set_varlabels(labels, fontsize=32)

    if figure_title is None:
        plt.show()
    else:
        plt.savefig(f'../charts/{figure_title}.pdf', format='pdf', quality=95)

Interface charts

Counts

Sum values of level 1 => 'grade' of each interface per family of features

In [122]:
level_0 = base.groupby(level=0).sum(axis=1)
In [124]:
level_0 = level_0.reindex(["newspaper metadata", 
                 "apis",
                 "connectivity",
                 "information on digitization",
                 "enrichment",
                 "user interaction",
                 "viewer",
                 "result display",
                 "result filtering",
                 "result sorting",
                 "search",
                 "browsing"
                    ])
In [125]:
level_0.head()
Out[125]:
Austrian Newspapers Online (ANNO) Ancestry British Newspaper Archives California Digital Newspaper Collection (CDNC) Chronicling America Colorado Historical Newspaper Collection (CHNC) Delpher DigiPress DIFMOE E-luxemburgensia ... Georgia Historic Newspapers Libraria - Ukrainian online periodicals archive New York Times POLONA Retronews Scriptorium StaBi Tessmann Le Temps archives Trove
newspaper metadata 9 3 5 6 12 2 6 7 9 6 ... 10 4 1 7 7 3 7 8 0 7
apis 0 0 0 0 1 0 1 1 0 0 ... 2 0 0 0 0 0 1 0 1 1
connectivity 0 0 0 0 2 0 1 0 0 0 ... 0 0 0 0 3 0 0 0 0 0
information on digitization 2 0 3 2 3 1 1 1 1 3 ... 1 4 0 1 2 1 0 1 1 3
enrichment 1 0 0 1 0 1 1 0 0 0 ... 0 0 0 0 5 0 0 2 0 1

5 rows × 23 columns

Observations per feature family

In [126]:
#Total sum per row, ie. per feature family (=> how good are all interfaces for a certain aspect): 
level_0.loc[:,'Total'] = level_0.sum(axis=1)
In [127]:
level_0['Total (%)'] = level_0['Total']/level_0['Total'].sum() * 100

Radar/Star/Spider with just one interface

In [128]:
# getting the labels
labels = level_0.index
labels
Out[128]:
Index(['newspaper metadata', 'apis', 'connectivity',
       'information on digitization', 'enrichment', 'user interaction',
       'viewer', 'result display', 'result filtering', 'result sorting',
       'search', 'browsing'],
      dtype='object')
In [129]:
# Get all rows, just for the first columns, and transpose it (for the radar factory)
level_0.iloc[:12, :1].T
Out[129]:
newspaper metadata apis connectivity information on digitization enrichment user interaction viewer result display result filtering result sorting search browsing
Austrian Newspapers Online (ANNO) 9 0 0 2 1 0 5 2 5 4 9 3
In [130]:
# take only the values
first_interface = level_0.iloc[:12, :1].T.values
In [131]:
build_single_radar(labels, values=first_interface, title=level_0.columns[0], grid=[2,4,6,8], figure_title='Anno')

Radar view for each interface in one figure

In [132]:
# take the data: all rows and up to the 23th columns (selecting precisely in case Total columns are added)
all_interfaces_counts = level_0.iloc[:12, :23].T.values

# same with percentages
level_0_percent = base.groupby(level=0).sum(axis=1).apply(lambda x: 100*x/float(x.sum()))
In [133]:
level_0_percent.head()
Out[133]:
Austrian Newspapers Online (ANNO) Ancestry British Newspaper Archives California Digital Newspaper Collection (CDNC) Chronicling America Colorado Historical Newspaper Collection (CHNC) Delpher DigiPress DIFMOE E-luxemburgensia ... Georgia Historic Newspapers Libraria - Ukrainian online periodicals archive New York Times POLONA Retronews Scriptorium StaBi Tessmann Le Temps archives Trove
apis 0.0 0.000000 0.000000 0.000000 2.631579 0.0 1.960784 2.5 0.0 0.000000 ... 6.060606 0.000000 0.0 0.000000 0.000000 0.0 4.545455 0.000000 5.0 2.040816
browsing 7.5 9.677419 5.882353 8.333333 2.631579 7.5 0.000000 5.0 7.5 6.896552 ... 12.121212 11.538462 0.0 5.405405 4.918033 0.0 9.090909 2.380952 0.0 10.204082
connectivity 0.0 0.000000 0.000000 0.000000 5.263158 0.0 1.960784 0.0 0.0 0.000000 ... 0.000000 0.000000 0.0 0.000000 4.918033 0.0 0.000000 0.000000 0.0 0.000000
enrichment 2.5 0.000000 0.000000 2.083333 0.000000 2.5 1.960784 0.0 0.0 0.000000 ... 0.000000 0.000000 0.0 0.000000 8.196721 0.0 0.000000 4.761905 0.0 2.040816
information on digitization 5.0 0.000000 8.823529 4.166667 7.894737 2.5 1.960784 2.5 2.5 10.344828 ... 3.030303 15.384615 0.0 2.702703 3.278689 4.0 0.000000 2.380952 5.0 6.122449

5 rows × 23 columns

In [134]:
# checking we have 100 everywhere
level_0_percent.sum()
Out[134]:
Austrian Newspapers Online (ANNO)                  100.0
Ancestry                                           100.0
British Newspaper Archives                         100.0
California Digital Newspaper Collection (CDNC)     100.0
Chronicling America                                100.0
Colorado Historical Newspaper Collection (CHNC)    100.0
Delpher                                            100.0
DigiPress                                          100.0
DIFMOE                                             100.0
E-luxemburgensia                                   100.0
E-newspaperarchives                                100.0
The European Library (TEL)                         100.0
L'Express                                          100.0
Georgia Historic Newspapers                        100.0
Libraria - Ukrainian online periodicals archive    100.0
New York Times                                     100.0
POLONA                                             100.0
Retronews                                          100.0
Scriptorium                                        100.0
StaBi                                              100.0
Tessmann                                           100.0
Le Temps archives                                  100.0
Trove                                              100.0
dtype: float64
In [135]:
level_0_percent.max().max()
Out[135]:
31.818181818181817
In [136]:
all_interfaces_percents = level_0_percent.iloc[:12, :23].T.values
In [139]:
build_multiple_radar(labels, all_interfaces_percents, level_0.columns[:23], [10, 20, 30], 'all-interfaces-single')

Global radar view (i.e. for all interfaces) over all features

In [140]:
# take only the last column: Total per feature family, in percent
values = level_0.iloc[:,24:].T.values
In [141]:
values
Out[141]:
array([[16.12121212,  1.21212121,  0.84848485,  4.24242424,  1.57575758,
         8.72727273, 13.09090909,  7.75757576, 13.21212121,  7.75757576,
        19.51515152,  5.93939394]])
In [142]:
build_single_radar(labels, values, "All interfaces",grid=[5,10,15,20],figure_title='all-interfaces-global')

Global radar view focusing on Search (where there are a lot of features)

In [143]:
search = base.loc['search'].copy()
search.loc[:,'Total'] = search.sum(axis=1)
search['Total (%)'] = search['Total']/search['Total'].sum() * 100
In [144]:
search.head()
Out[144]:
Austrian Newspapers Online (ANNO) Ancestry British Newspaper Archives California Digital Newspaper Collection (CDNC) Chronicling America Colorado Historical Newspaper Collection (CHNC) Delpher DigiPress DIFMOE E-luxemburgensia ... New York Times POLONA Retronews Scriptorium StaBi Tessmann Le Temps archives Trove Total Total (%)
Basic keyword search 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 24.0 14.457831
Query autocomplete 0 0 0 0 0 0 0 0 0 0 ... 1 1 0 0 1 1 0 0 5.0 3.012048
Boolean operators (AND, OR, NOT) 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 0 1 1 1 21.0 12.650602
Phrase search 1 1 0 1 1 1 1 1 1 0 ... 0 0 1 0 0 1 1 1 14.0 8.433735
Fuzzy search 0 0 0 0 0 0 1 0 0 1 ... 0 0 1 1 1 1 0 0 7.0 4.216867

5 rows × 26 columns

In [145]:
labels_search = search.index
In [146]:
labels_search
Out[146]:
Index(['Basic keyword search', 'Query autocomplete',
       'Boolean operators (AND, OR, NOT)', 'Phrase search', 'Fuzzy search',
       'Wild card', 'Proximity search (near operator)', 'Limit the date range',
       'Limit by language', 'Limit newspaper title(s)',
       'Limit the place of publication',
       'Limit by newspaper thematic (from metadata)',
       'Limit newspaper segments / zones', 'Limit by article category',
       'Limit article length', 'Limit by archival holder / library',
       'Limit by license / accessibility', 'Query suggestion',
       'Search by named entities'],
      dtype='object')
In [158]:
values_search = search.iloc[:,25:].T.values
In [159]:
values_search
Out[159]:
array([[14.45783133,  3.01204819, 12.65060241,  8.43373494,  4.21686747,
         4.21686747,  3.01204819, 13.25301205,  3.01204819,  9.63855422,
         7.22891566,  2.40963855,  6.62650602,  1.20481928,  0.60240964,
         2.40963855,  2.40963855,  0.60240964,  0.60240964]])
In [163]:
build_single_radar(labels_search, values_search, title="Search", grid=[2,4,6,8,10,12,14], figure_title="search")
In [ ]: